Neste projeto, você irá usar o R e aplicar técnicas de análise exploratória de dados para verificar relações em uma ou mais variáveis e explorar um conjunto de dados específico para encontrar distribuições, outliers e anomalias.
Análise Exploratório de dados (Exploratory Data Analysis, ou EDA) é a análise numérica e visual das características de dados e seus relacionamentos usando métodos formais e estratégias estatísticas.
EDA pode nos trazer insights, que podem nos levar a novas questões, e eventualmente a modelos preditivos. É uma importante “linha de defesa” contra dados ruins e uma oportunidade de comprovar se suas suposições ou intuições sobre um conjunto estão sendo violadas.
Essa análise irá explorar um conjunto de dados de vinhos tintos [Cortez et al., 2009], originalmente construído para modelagem da qualidade do vinho refletida por aspectos químicos de cada bebida. Obtive a ajuda de um amigo formado em química para me guiar em possíveis aspectos quimícos que podem gerar um gosto desagradável no vinho, e sob essas hipoteses guiarei minha analise.
Para iniciar iremos analisar cada variável separadamente para termos uma ideia do que estamos lidando:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Podemos ver que os dados estão bem formatados e embora algumas colunas aparentem ter outliers nada parece fora do normal.
Primeiro removemos a coluna de index que não é necessária.
Começaremos pela variável qualidade:
Embora tenhamos notas possiveis de 0 a 10 os dados apresentam notas apenas no intervalo 3-8 com pico no 5 e poucos exemplos nos extremos, olharemos de forma mais detalhada:
Vinhos piores:
## [1] 63
Vinhos melhores:
## [1] 217
Apenas 18 vinhos receberam a nota mais alta dos jurados e os de qualidade baixa também se encontram com pouca representatividade, iremos voltar a essa analise posteriormente.
Agora analisaremos a quantidade de álcool.
A quantidade de álcool mais comum está por volta de 9.4, com uma distribuição bem irregular (talvez uma binormal), talvez seja interessante criar subconjuntos das diferentes qualidades de vinhos para analisar melhor.
Não está muito claro devido a baixa amostragem de dados para binhos bons mas aparenta que vinhos melhores tenham mais álcool que vinhos ruins, suponho que pelo tempo de fermentação que vinhos melhores levam eles acumulam mais alcool, mas para ter mais confiança dessa afirmação é necessário uma analise de regressão.
Agora analisaremos o açúcar residual dos nossos vinhos contém.
Com uma distribuição de cauda pesada devemos setar aumentar a precisão no eixo x e aumentar a quantidade de barras para visualizar melhor.
Existe um pico ao redor do 2, vamos analisar essa região.
Neste intervalo os dados parecem estar distribuidos de forma normal, sendo onde a maioria dos vinhos se encontram, para as outras regiões talvez encontremos outliers quanto a qualidade do vinho, vinhos muito doces tendem a ser considerados ruins.
Agora voltemos a analisar a distruibuição de cauda pesada, para isso renormalizamos aplicando uma scala logaritmica.
Bem melhor, agora podemos ver um mini pico para os dados acima de 10.
Vamos analisar agora o açúcar residual nos vinhos outliers:
As modas estão em 2 porém os vinhos ruins possuem outliers a muitos desvios padrões da média (13), e as distribuições são de cauda pesada.
Cloretos indicam a salinidade no vinhos, não podendo conter em excesso, estragando o vinho.
Aqui também com cauda pesada iremos aplicar a transformação log.
Como é visivel, existe uma grande acumulação entre 0.07 e 0.09, e outliers a esquerda e direita.
Vejamos como eles desempenham:
Os de pouca salinidade tiveram notas altas, interessante.
Vemos agora o pH que descreve a acidez/basicidade do vinho na escala de 0 a 14.
Aqui vemos uma distribuição normal e bem centrada, vejamos a relação com a qualidade dos vinhos.
Não é visivel nenhuma diferença significativa entre os vinhos.
A densidade depende da quantidade de alcool e açucar residual, vejamos como está essa distribuição.
Nada fora do comum por aqui, mas vejamos como está em relação a qualidade.
Não há uma separação significativa entre as distribuições.
Uma das principais caracteristicas do sabor do vinho, talvez a mais interessante dos dados.
Os dados estão com uma distribuição muito estranha, não sendo claro alguma forma de analisa-los, mas como esperado é uma caracteristica distoante entre os vinhos. Vejamos mais de perto entre os picos:
Vamos ver agora a concentração para vinhos bons e ruins separadamente:
Para os vinhos ruins está uma cauda pesada com centro a esquerda e esparsa, ja para os vinhos bons uma distribuição talvez binormal.
Sulfatos são adicionados ao vinho para controlar aspectos na fabricação, não interferindo muito no produto final.
Com cauda pesada novamente iremos aplicar uma transformação log.
Agora temos um histograma mais centralizado com varios picos e alguns outliers, acredito que tais picos sejam dados pelo arrendondamento já que estamos em um intervalo pequeno.
Novamente analisando em relação a vinhos bons e ruins.
Vinhos ruins estão com outliers com valores bem altos, talvez isso colabore na pessima nota.
Aqui analisamos a acidez volatil, em excesso pode deixar o vinho com gosto de vinagre.
Para essa distribuição temos varios outliers de valores bem altos, acredito que esses vinhos tenham recebido nota ruim, valor analisar:
Pelo gráfico podemos ver que isso não é um fator determinante na qualidade do vinho, sendo as distribuições pertencendo ao mesmo intervalo.
Agora para acidez volatil:
A distribuição está bem inregular e acredito que novamente seja pelo truncamento, agora as distribuições para vinhos bons e ruins.
A distribuição para vinhos bons parece que foi deslizada a esqueda e menos espaçada
Comecemos a analise do SO2 pelo enxofre livre:
Quase todas estão a menos de 60, vamos dar um zoom nisso.
A distribuição tem um pico proximo do valor 7.
Agora comparando para vinhos bons e ruins:
Para os vinhos ruins vemos uma distribuição mais larga, porem os vinhos bons estão contidos nesse intervalo.
Vamos criar a variavel bound, que é o enxofre total menos o enxofre livre:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 21.00 30.59 39.00 251.50
Agora visualizamos alguns histogramas para ver como se comporta.
Temos alguns outliers vamos ver mais proximo.
Temos aqui vinhos de boa qualidade.
Comparando bons com ruins para essa variável:
Vinhos bons tem um pico muito maior e outliers maiores também.
Agora analisando a quantidade total de enxofre:
Esse histograma mostra 2 pontos de outlier, vamos dar uma olhada neles.
são os mesmos vinhos de boa qualidade que obtivemos para o enxofre ligado.
Agora o comparativo das distribuições para vinhos bons e ruins.
The poor wines histogram peaks at 109 and then at 189, whereas the excellent wines histogram shows two distinct peaks situated fairly close to each other - at 99 and 119. Also, poor wine samples are more spread out across the X axis, and the poor wines distribution seems to have a left tail.
O conjunto de dados tem 1599 registros com 11 variáveis (de aspecto químico) sendo elas fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates e alcohol + qualidade do vinho (de 0 a 10) reportada por profissionais da área.
O atributo de interesse é a qualidade do vinho dado que tal dataset foi construido com o objetivo de fazer uma analise estatistica sobre quais fatores influenciam na qualidade do vinho.
Pela analise até o momento a maioria dos fatores contribui para a qualidade do vinho, porém pH, enxofre e cloretos me paraceram mais interessante.
Criei, na seção de enxofre, criei a variável bound sulfur que é o enxofre total menos o enxofre livre, sendo esse bound o enxofre ligado a outras moleculas no vinho.
Foram encontradas diversas distribuições com outliers e de cauda pesada, os outliers analisei em graficos separadamente e as distribuições de cauda pesada apliquei a função logaritimica tornando minha distribuição normalizada, facilitando a analise
Nessa seção analisaremos as relações entre as features par a par.
Temos uma correlação significativa para a qualidade do vinho apenas para a variável alcool, o que a principio desestimula uma analise mais profunda, porém existem relações entre mais variáveis que por enquanto nos estão ocultas, além de transformações que podem ser feitas tornando as relações lineares.
Citando as relações par a par, vemos que algumas variáveis estão bem relacionadas, densidade e acidez fixada, pH e acidez fixada, enxofre ligado e enxofre total e outras não citadas menos relacionadas, variando positivamente e negativamente.
Esse par apresenta a maior correlação positiva.
Para esses outros dois plots vemos dados bem espalhados, sem nenhuma relação não-linear clara.
Aqui vemos uma correlação positiva, quanto maior a quantidade de alcool, mais provavel o vinho ter uma nota mais alta.
Esses dois pares tem pouca correlação com entre as variáveis, sendo as distribuições bem concentradas proximo a origem.
Aqui a correlação indica que vinhos de maior densidade apresentam menos alcool e de menor densidade mais alcool.
Aqui também vemos uma correlação fraca entre alcool e açúcar residual.
Aqui a indicios de uma correlação não muito forte entre enxofre ligado e o inverso da quantidade de alcool.
Temos aqui duas correlação inversamente fortes, ph e acidez fixada, alcool e cloros.
Aqui investigamos as distribuições da relação entre notas e aspecto quimico:
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.700 7.150 7.500 8.360 9.875 11.600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.600 6.800 7.500 7.779 8.400 12.500
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.100 7.800 8.167 8.900 15.900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.700 7.000 7.900 8.347 9.400 14.300
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.900 7.400 8.800 8.872 10.100 15.600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 7.250 8.250 8.567 10.225 12.600
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.875 2.100 2.635 3.100 5.700
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.900 2.100 2.694 2.800 12.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 1.900 2.200 2.529 2.600 15.500
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.477 2.500 15.400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.200 2.000 2.300 2.721 2.750 8.900
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 1.800 2.100 2.578 2.600 6.400
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0610 0.0790 0.0905 0.1225 0.1430 0.2670
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.0 5.0 6.0 11.0 14.5 34.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 12.26 15.00 41.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 15.00 16.98 23.00 68.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 14.00 15.71 21.00 72.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 11.00 14.05 18.00 54.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 6.00 7.50 13.28 16.50 42.00
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 6.75 11.00 13.90 13.75 37.00
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 8.00 14.00 23.98 32.00 107.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 14.00 29.00 39.53 58.00 128.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 11.00 19.00 25.16 33.00 126.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 8.50 15.00 20.97 21.50 251.50
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 9.25 11.00 20.17 22.75 76.00
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 12.5 15.0 24.9 42.5 49.0
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 14.00 26.00 36.25 49.00 119.00
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 26.00 47.00 56.51 84.00 155.00
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 23.00 35.00 40.87 54.00 165.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 17.50 27.00 35.02 43.00 289.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 16.00 21.50 33.44 43.00 88.00
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0032
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.160 3.312 3.390 3.398 3.495 3.630
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.300 3.370 3.382 3.500 3.900
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.300 3.305 3.400 3.740
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.860 3.220 3.320 3.318 3.410 4.010
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.920 3.200 3.280 3.291 3.380 3.780
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.163 3.230 3.267 3.350 3.720
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Fazendo agora um grafico da densidade divididos pela qualidade.
In all these plots, we can clearly see a bimodal distribution for the best wines. I guess this effect is due to there being very few wine samples with grade 9 in the dataset that take on just several values. Our analysis might have benefited from a greater number of highest-quality wines, as we could’ve checked whether this pronounced bimodality has to do with insufficient data or there’re some other factors at play.
As for the last density plot for residual sugar, the distributions seem quite skewed - and indeed, in the first section of this analysis, we’ve found out that the residual sugar distribution has a very heavy right tail. Let’s now try rebuilding the same density plot, but with the residual sugar variable log-transformed.
Now it becomes obvious this distribution is actually bimodal across all wine grades! Pretty curious finding that I can’t explain right away for the lack of the domain knowledge. It might even be a phenomenon peculiar to Portuguese wines - it’s really hard to tell without having more data handy.
The strongest positive correlation involving quality is quality vs alcohol (0.436). One particularly interesting thing here is that an upward trend (quality increases as alcohol content grows) holds true only for higher-quality wines, starting from the grade of 6; below this point the trend is actually downward: for wine samples graded 3-5, the lower the alcohol level, the better the wine. The median alcohol value of less than 12 indicates that a wine sample’s maximum score is 7, which might help us tell a good wine sample from a poor one.
The most pronounced negative correlation that has to do with our main feature is observed in the pair quality - density (-0.307). The general trend there is a downward one: with each grade, median density decreases a bit, with a notable exception of one group - wine samples of grade 5, which break this trend and actually have the greatest median density of all grades. The exactly same picture can be seen in quality vs bound SO2 (-0.218): grade 5 wines once again break the generally downward trend.
Another interesting pattern was discovered in the pair quality vs volatile acidity (-0.195): median values there seemed to change in a wave-like fashion from one grade to another, going up and down a few times.
One more curious finding was that the residual sugar distribution, which is highly skewed initially, when log-transformed and color-coded by quality, is actually bimodal across all the wine grades, from lowest to highest. As I said above, under the relevant plot, I might be lacking some specialist knowledge to draw the right conclusion based on this fact, or it might just be a peculiarity of Portuguese wines, white ones in particular.
Fun fact: positive correlations were dominated by density (3 occurrences out of 6), negative ones by alcohol, featured even more prominently (5 occurrences out of 6). Therefore, it’s only natural that these two features produced the most highly correlated pairs (which I’m talking about in more detail in the subsection below), and density had a part in both of them!
Among other things, total SO2 and bound SO2 turned out to be positively correlated with both density and residual sugar. As for the negative correlation, one of the strongest relationships were observed in such pairs as: total SO2 and free SO2 vs alcohol; pH vs fixed acidity; alcohol vs residual sugar and chlorides.
Surprisingly enough, the two most pronounced correlations didn’t involve the main variable, quality, but instead featured density, which seems to be heavily dependent on both residual sugar and alcohol content. In the former case, the correlation is positive and equals 0.839; in the latter case, the features are negatively correlated (-0.78).
In the previous section, we used box plots to see how different variables are distributed across wine grades and scatter plots to discover interesting pairwise relationships between the features. This section allows us to take our analysis one step further by combining the two techniques and examining what relationships the features display (and how these relationships vary) across wine grades.
Let’s first take a look at a couple of scatter plots for the features that exhibited the strongest correlation, faceted by quality.
Looks like no surprises here. Scatter plots demonstrate the same trends across all wine quality grades: upward for density vs residual sugar and downward for density vs alcohol.
I wonder what plots would look like for less correlated features.
For the lowest-quality wines, alcohol doesn’t seem to be correlated with residual sugar at all, with a negative trend becoming more noticeable towards higher wine grades.
Somewhat similar picture here. In case of the worst and best wines, alcohol and total So2 are much less correlated (if correlated at all) as compared with wine samples of other grades, which all display a more prominent downward trend.
This time the weakest correlation between the features takes place with the best wine samples. In all other cases, an upward trend is obvious.
We’ll now build a pretty straightforward linear model to see how well it can predict wine quality based on the features we’ve analyzed.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + residual.sugar, data = wine)
## m3: lm(formula = quality ~ alcohol + residual.sugar + density, data = wine)
## m4: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity,
## data = wine)
## m5: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH, data = wine)
## m6: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH + sulphates, data = wine)
## m7: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity +
## pH + sulphates + free.sulfur.dioxide, data = wine)
##
## =========================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## -------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.882*** -42.884*** -24.273* -13.811 -0.150 2.280
## (0.175) (0.176) (12.051) (11.433) (11.858) (11.944) (12.107)
## alcohol 0.361*** 0.361*** 0.401*** 0.339*** 0.346*** 0.325*** 0.320***
## (0.017) (0.017) (0.020) (0.019) (0.019) (0.019) (0.020)
## residual.sugar -0.004 -0.026 -0.016 -0.015 -0.007 -0.003
## (0.013) (0.014) (0.013) (0.013) (0.013) (0.013)
## density 44.547*** 27.216* 17.881 3.630 1.209
## (11.990) (11.367) (11.702) (11.812) (11.975)
## volatile.acidity -1.359*** -1.272*** -1.154*** -1.160***
## (0.096) (0.099) (0.100) (0.100)
## pH -0.383** -0.303* -0.290*
## (0.119) (0.119) (0.119)
## sulphates 0.628*** 0.642***
## (0.104) (0.105)
## free.sulfur.dioxide -0.002
## (0.002)
## -------------------------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.227 0.233 0.319 0.324 0.339 0.340
## adj. R-squared 0.226 0.226 0.232 0.318 0.322 0.336 0.337
## sigma 0.710 0.711 0.708 0.667 0.665 0.658 0.658
## F 468.267 234.040 161.879 187.064 152.580 136.047 116.861
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1721.016 -1714.127 -1618.932 -1613.786 -1595.704 -1594.954
## Deviance 805.870 805.829 798.915 709.235 704.685 688.926 688.280
## AIC 3448.114 3450.031 3438.254 3249.864 3241.573 3207.408 3207.908
## BIC 3464.245 3471.540 3465.139 3282.127 3279.213 3250.425 3256.302
## N 1599 1599 1599 1599 1599 1599 1599
## =========================================================================================================================
The variables in this linear model can account for 28% of the variance in the quality of white wine.
The most prominent correlations we’ve discovered were in fact so strong that, when faceted by wine quality, the features displayed the same trends across all wine grades: for density vs residual sugar, the trend was always upward, for density vs alcohol always downward.
For other, less correlated features (alcohol vs residual sugar, alcohol vs total SO2, density vstotal SO2), the trend across the wine grades was also the same, with an exception of best or worst wines, or both, whereby features showed little to no correlation whatsoever.
Since the correlation between density and residual sugar was quite higher than that of density and alcohol (0.839 vs -0.78), I was epsecially interested to see how residual sugar and alcohol were correlated and expected at least a slightly positive correlation. To my surprise, the correlation turned out to be strongly negative (-0.451, second strongest among negative correlations discovered); in fact, it was so strong that a negative downward trend manifested itself across 6 out 7 wine grades represented in the dataset, except for grade 3, where features showed no correlation at all.
Yes, I did create a linear model that makes a prediction based on 7 features from the dataset. Further increasing the number of features didn’t yield any significant improvement, so I stopped at this value. Surprisingly enough, the model explains a mere 28% of the variance in the target variable, which is quality. It seems like wine quality is not well supported by its physico-chemical properties. Two things to note here: first, quality of prediction could be improved with more data (right now, it’s less than 5,000 samples); second, there’re some other factors at play, so the model might have benefited from addition of such variables as price of wine, region where it was produced, year it was produced and other things not related to wine chemistry. Trying out other models may also lead to better results. Say, I have a hunch that tree-based methods would do well in this case.
This box plot supports our finding saying that the strongest positive correlation our main variable of interest is involved in is quality vs alcohol (0.436). An interesting thing here is that for lower wine grades, we can actually observe a negative downward trend that gets reversed only from grade 5 onwards. Thus, for wines of up to grade 5, the lower the alcohol content, the better a wine tends to be; after that wine quality grows linearly with increasing alcohol content.
Moreover, the median (and mean as well) alcohol content of best wines looks significantly different from that of worst wines, which can be used to more or less reliably tell a quality wine from a poor one.
When plotted unmodified, the residual sugar distribution is highly skewed and has a long right tail. However, when log-transformed, the distribution becomes bimodal. When I later color-coded the plot, I saw the distribution was in fact bimodal across all the wine grades. Intrigued by this phenomenon, I read a few specialized articles on residual sugar in wines, but couldn’t find any explanation that would satisfy me. Therefore I’m inclined to think, for the lack of proof to the contrary, that it’s just a regional thing specific to Portuguese wines.
This faceted scatter plot illustrates the third strongest negative correlation discovered during the analysis - alcohol vs total SO2. Each subplot contains a line of best fit that visually reinforces the trend across wine grades. One interesting observation here is that with best and worst wines, the features display little to no correlation whatsoever, whereas for wines of grades 4 through 8, a clearly negative downward trend manifests itself. It might be an indication of the fact that this particular combination of features is a bad candidate for predicting wine quality. Indeed, when I was building a linear model, alcohol turned out to be the best contributor to the overall quality of prediction, whereas total SO2 added absolutely nothing to improve it and therefore was not included in the resulting model.
The dataset I’ve analyzed contains information on almost 5,000 white wines across 11 variables plus the output variable based on sensory data, that is a grade on a scale of 0 to 10 given to each wine sample by professional wine judges. This dataset is restricted to Portuguese wines and contains only their physico-chemical properties.
I began my analysis by building histograms of each feature to understand their distribution. They turned out to be normally distributed, with a few notable exceptions (take residual sugar as an example), where I observed heavy skew and long tails. Log-transforming these variables helped me deal with this abnormality. I also defined thresholds for poor (grade 4 and under) and excellent (grade 8 and over) wines, then subset my dataset using these thresholds and plotted distributions of individual features across poor and excellent wines side by side. This helped me see whether these distributions were very different and identify a few potential candidates that could be useful in telling a low-quality wine from a better one.
I went on to explore pairwise relationships between the features and pick out the most strongly correlated (both positively and negatively) pairs to focus my analysis on them. To my surprise, the main variable of interest - quality - wasn’t involved in any of the strongest correlations identified. I built a few scatter plots and included a line of best fit for each of them to more clearly see the general trend in the data points. Then I added a few box plots that reinforced my earlier findings and offered some new insights.
My greatest success was finding out that alcohol content was the most influential feature that could more or less reliably be used to differentiate between poor and excellent wines. Indeed, when I later built a linear model to predict a wine grade, this feature alone contributed over 70% to the overall prediction quality.
In the final part of my analysis, I used wine grades to color-code and facet a few plots that I’d built previously to see if any variables reinforce each other across any of the wine grades. The main finding here was that in the two most strongly correlated pairs the corelation was so pronounced that the trend stayed the same across all wine grades: it was always upward for density vs residual sugar and downward for alcohol vs density. The situation was a bit different for more weakly correlated pairs: the trend did stay the same across most wine grades, but with worst or best wines, the features I was analyzing displayed little to no correlation at all (for example, alcohol vs total SO2), which signaled these combinations were probably not the best predictors of wine quality. I tested these findings when building a linear model and excluded the worst contributors from the final version.
I’ve also bumped into a couple of obstacles along the way. First, I found out that the residual sugar distribution, when log-transformed, is bimodal across all wine grades. I’ve been struggling to explain this phenomenon for some time and even read a few specialized articles on the topic, but found no satisfactory explanation so far. So I’m inclined to believe this phenomenon is specific to Portuguese wines, since that’s what I’ve been analyzing all along.
Another thing I had difficulties with was the linear model that I’d built. It was able to explain only 28% of the variance in wine quality, which I found to be a pretty poor result. At first, I thought I was doing something wrong and actually spent a couple of days trying to engineer new features and combine them in various ways (to no avail), but then I realized that some other factors were at play and physico-chemical properties alone were not enough of a quality predictor.
And this realization leads me to suggestions on how to improve this analysis. First and foremost, more data would be nice. 5,000 wine samples is alright, but given the number of wines in the world, it’s just a drop in the ocean. Besides, the dataset is restricted to only Portuguese wines, which significantly limits its value and ability to represent the whole population. Second, as I mentioned above, there must be some other features that heavily influence wine quality. Better results might have been obtained if we had information about a region where a wine was produced, the year it was produced, grape type, selling price and wine brand, to name a few. Also, it might be a good idea to test other kinds of models and see how they fare against each other. I guess more powerful models, like SVM or tree-based methods, could have demonstrated impressive results.